41 research outputs found

    Depth Separation for Neural Networks

    Full text link
    Let f:Sdβˆ’1Γ—Sdβˆ’1β†’Sf:\mathbb{S}^{d-1}\times \mathbb{S}^{d-1}\to\mathbb{S} be a function of the form f(x,xβ€²)=g(⟨x,xβ€²βŸ©)f(\mathbf{x},\mathbf{x}') = g(\langle\mathbf{x},\mathbf{x}'\rangle) for g:[βˆ’1,1]β†’Rg:[-1,1]\to \mathbb{R}. We give a simple proof that shows that poly-size depth two neural networks with (exponentially) bounded weights cannot approximate ff whenever gg cannot be approximated by a low degree polynomial. Moreover, for many gg's, such as g(x)=sin⁑(Ο€d3x)g(x)=\sin(\pi d^3x), the number of neurons must be 2Ξ©(dlog⁑(d))2^{\Omega\left(d\log(d)\right)}. Furthermore, the result holds w.r.t.\ the uniform distribution on Sdβˆ’1Γ—Sdβˆ’1\mathbb{S}^{d-1}\times \mathbb{S}^{d-1}. As many functions of the above form can be well approximated by poly-size depth three networks with poly-bounded weights, this establishes a separation between depth two and depth three networks w.r.t.\ the uniform distribution on Sdβˆ’1Γ—Sdβˆ’1\mathbb{S}^{d-1}\times \mathbb{S}^{d-1}

    Complexity Theoretic Limitations on Learning Halfspaces

    Full text link
    We study the problem of agnostically learning halfspaces which is defined by a fixed but unknown distribution D\mathcal{D} on QnΓ—{Β±1}\mathbb{Q}^n\times \{\pm 1\}. We define ErrHALF(D)\mathrm{Err}_{\mathrm{HALF}}(\mathcal{D}) as the least error of a halfspace classifier for D\mathcal{D}. A learner who can access D\mathcal{D} has to return a hypothesis whose error is small compared to ErrHALF(D)\mathrm{Err}_{\mathrm{HALF}}(\mathcal{D}). Using the recently developed method of the author, Linial and Shalev-Shwartz we prove hardness of learning results under a natural assumption on the complexity of refuting random KK-XOR\mathrm{XOR} formulas. We show that no efficient learning algorithm has non-trivial worst-case performance even under the guarantees that ErrHALF(D)≀η\mathrm{Err}_{\mathrm{HALF}}(\mathcal{D}) \le \eta for arbitrarily small constant Ξ·>0\eta>0, and that D\mathcal{D} is supported in {Β±1}nΓ—{Β±1}\{\pm 1\}^n\times \{\pm 1\}. Namely, even under these favorable conditions its error must be β‰₯12βˆ’1nc\ge \frac{1}{2}-\frac{1}{n^c} for every c>0c>0. In particular, no efficient algorithm can achieve a constant approximation ratio. Under a stronger version of the assumption (where KK can be poly-logarithmic in nn), we can take Ξ·=2βˆ’log⁑1βˆ’Ξ½(n)\eta = 2^{-\log^{1-\nu}(n)} for arbitrarily small Ξ½>0\nu>0. Interestingly, this is even stronger than the best known lower bounds (Arora et. al. 1993, Feldamn et. al. 2006, Guruswami and Raghavendra 2006) for the case that the learner is restricted to return a halfspace classifier (i.e. proper learning)

    Locally Private Learning without Interaction Requires Separation

    Full text link
    We consider learning under the constraint of local differential privacy (LDP). For many learning problems known efficient algorithms in this model require many rounds of communication between the server and the clients holding the data points. Yet multi-round protocols are prohibitively slow in practice due to network latency and, as a result, currently deployed large-scale systems are limited to a single round. Despite significant research interest, very little is known about which learning problems can be solved by such non-interactive systems. The only lower bound we are aware of is for PAC learning an artificial class of functions with respect to a uniform distribution (Kasiviswanathan et al. 2011). We show that the margin complexity of a class of Boolean functions is a lower bound on the complexity of any non-interactive LDP algorithm for distribution-independent PAC learning of the class. In particular, the classes of linear separators and decision lists require exponential number of samples to learn non-interactively even though they can be learned in polynomial time by an interactive LDP algorithm. This gives the first example of a natural problem that is significantly harder to solve without interaction and also resolves an open problem of Kasiviswanathan et al. (2011). We complement this lower bound with a new efficient learning algorithm whose complexity is polynomial in the margin complexity of the class. Our algorithm is non-interactive on labeled samples but still needs interactive access to unlabeled samples. All of our results also apply to the statistical query model and any model in which the number of bits communicated about each data point is constrained

    The price of bandit information in multiclass online classification

    Full text link
    We consider two scenarios of multiclass online learning of a hypothesis class HβŠ†YXH\subseteq Y^X. In the {\em full information} scenario, the learner is exposed to instances together with their labels. In the {\em bandit} scenario, the true label is not exposed, but rather an indication whether the learner's prediction is correct or not. We show that the ratio between the error rates in the two scenarios is at most 8β‹…βˆ£Yβˆ£β‹…log⁑(∣Y∣)8\cdot|Y|\cdot \log(|Y|) in the realizable case, and O~(∣Y∣)\tilde{O}(\sqrt{|Y|}) in the agnostic case. The results are tight up to a logarithmic factor and essentially answer an open question from (Daniely et. al. - Multiclass learnability and the erm principle). We apply these results to the class of Ξ³\gamma-margin multiclass linear classifiers in Rd\reals^d. We show that the bandit error rate of this class is Θ~(∣Y∣γ2)\tilde{\Theta}(\frac{|Y|}{\gamma^2}) in the realizable case and Θ~(1γ∣Y∣T)\tilde{\Theta}(\frac{1}{\gamma}\sqrt{|Y|T}) in the agnostic case. This resolves an open question from (Kakade et. al. - Efficient bandit algorithms for online multiclass prediction)

    Tight products and Expansion

    Full text link
    In this paper we study a new product of graphs called {\em tight product}. A graph HH is said to be a tight product of two (undirected multi) graphs G1G_1 and G2G_2, if V(H)=V(G1)Γ—V(G2)V(H)=V(G_1)\times V(G_2) and both projection maps V(H)β†’V(G1)V(H)\to V(G_1) and V(H)β†’V(G2)V(H)\to V(G_2) are covering maps. It is not a priori clear when two given graphs have a tight product (in fact, it is NPNP-hard to decide). We investigate the conditions under which this is possible. This perspective yields a new characterization of class-1 (2k+1)(2k+1)-regular graphs. We also obtain a new model of random dd-regular graphs whose second eigenvalue is almost surely at most O(d3/4)O(d^{3/4}). This construction resembles random graph lifts, but requires fewer random bits

    Competitive ratio versus regret minimization: achieving the best of both worlds

    Full text link
    We consider online algorithms under both the competitive ratio criteria and the regret minimization one. Our main goal is to build a unified methodology that would be able to guarantee both criteria simultaneously. For a general class of online algorithms, namely any Metrical Task System (MTS), we show that one can simultaneously guarantee the best known competitive ratio and a natural regret bound. For the paging problem we further show an efficient online algorithm (polynomial in the number of pages) with this guarantee. To this end, we extend an existing regret minimization algorithm (specifically, Kapralov and Panigrahy) to handle movement cost (the cost of switching between states of the online system). We then show how to use the extended regret minimization algorithm to combine multiple online algorithms. Our end result is an online algorithm that can combine a "base" online algorithm, having a guaranteed competitive ratio, with a range of online algorithms that guarantee a small regret over any interval of time. The combined algorithm guarantees both that the competitive ratio matches that of the base algorithm and a low regret over any time interval. As a by product, we obtain an expert algorithm with close to optimal regret bound on every time interval, even in the presence of switching costs. This result is of independent interest

    Complexity theoretic limitations on learning DNF's

    Full text link
    Using the recently developed framework of [Daniely et al, 2014], we show that under a natural assumption on the complexity of refuting random K-SAT formulas, learning DNF formulas is hard. Furthermore, the same assumption implies the hardness of learning intersections of Ο‰(log⁑(n))\omega(\log(n)) halfspaces, agnostically learning conjunctions, as well as virtually all (distribution free) learning problems that were previously shown hard (under complexity assumptions).Comment: arXiv admin note: substantial text overlap with arXiv:1311.227

    Optimal Learners for Multiclass Problems

    Full text link
    The fundamental theorem of statistical learning states that for binary classification problems, any Empirical Risk Minimization (ERM) learning rule has close to optimal sample complexity. In this paper we seek for a generic optimal learner for multiclass prediction. We start by proving a surprising result: a generic optimal multiclass learner must be improper, namely, it must have the ability to output hypotheses which do not belong to the hypothesis class, even though it knows that all the labels are generated by some hypothesis from the class. In particular, no ERM learner is optimal. This brings back the fundmamental question of "how to learn"? We give a complete answer to this question by giving a new analysis of the one-inclusion multiclass learner of Rubinstein et al (2006) showing that its sample complexity is essentially optimal. Then, we turn to study the popular hypothesis class of generalized linear classifiers. We derive optimal learners that, unlike the one-inclusion algorithm, are computationally efficient. Furthermore, we show that the sample complexity of these learners is better than the sample complexity of the ERM rule, thus settling in negative an open question due to Collins (2005)

    Memorizing Gaussians with no over-parameterizaion via gradient decent on neural networks

    Full text link
    We prove that a single step of gradient decent over depth two network, with qq hidden neurons, starting from orthogonal initialization, can memorize Ω(dqlog⁑4(d))\Omega\left(\frac{dq}{\log^4(d)}\right) independent and randomly labeled Gaussians in Rd\mathbb{R}^d. The result is valid for a large class of activation functions, which includes the absolute value

    Neural Networks Learning and Memorization with (almost) no Over-Parameterization

    Full text link
    Many results in recent years established polynomial time learnability of various models via neural networks algorithms. However, unless the model is linear separable, or the activation is a polynomial, these results require very large networks -- much more than what is needed for the mere existence of a good predictor. In this paper we prove that SGD on depth two neural networks can memorize samples, learn polynomials with bounded weights, and learn certain kernel spaces, with near optimal network size, sample complexity, and runtime. In particular, we show that SGD on depth two network with O~(md)\tilde{O}\left(\frac{m}{d}\right) hidden neurons (and hence O~(m)\tilde{O}(m) parameters) can memorize mm random labeled points in Sdβˆ’1\mathbb{S}^{d-1}
    corecore